A METHOD OF IMPLEMENTING ONE-TO-ONE BINARY FUNCTION 
AND RELATIVE HARDWARE DEVICE, ESPECIALLY FOR 
A RIJNDAEL S-BOX 

Field of -the Invention 
[0001] The present invention relates to implementing 
a one-to-one binary function that is particularly 
suited for forming S-box devices performing the ByteSub 
operation of the Rijndael AES encryption/decryption 
algorithm. 

Background of tiie Invention 
[0002] In devices implementing encryption/decryption 
algorithms it is necessary to perform binary functions 
over an input set of bytes to be encrypted/decrypted 
for generating corresponding output bytes. These 
operations include a one-to-one binary function that is 
implemented by logic circuitry. . This logic circuitry 
is required to be fast, consume low power and occupy a 
small silicon area. 

[0003] Because of the importance of the Rijndael AES 
encryption/decryption algorithm, the problem to be 
addressed is presented with respect to this algorithm, 
but the same considerations hold also for any binary 
one-to-one function . 

[0004] A brief overview of AES will now be provided. 
In January 2, 1997, the National Institute of Standards 
and Technology (NIST) announced the beginning of the 
development of the Advanced Encryption Standard (AES) . 
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The overall goal was to develop a Federal Information 
Processing Standard (FIPS) that specified an encryption 
algorithm capable of protecting sensitive 
(unclassified) government information into the twenty- 
first century. 

[0005] The formal call for algorithms was made on 
September 12, 1997. The algorithms were required to 
implement symmetric key cryptography as a block cipher 
and to support a block size of 128 bits, and key sizes 
of 128, 192 and 256 bits. In August 20, 1998, NIST 
announced fifteen AES candidate algorithms at the First 
AES Candidate Conference, and solicited public comments 
on the candidates. A Second AES Candidate Conference 
was held in March 1999 to discuss the results of the 
analysis that was conducted by the international 
cryptographic community on the candidate algorithms. In 
August 1999, NIST announced its selection of five 
finalist algorithms from the fifteen candidates. The 
selected algorithms were MARS, RC6, Rijndael, Serpent 
and Twofish. 

[0006] A lot of attention had been put on the 
complexity of the algorithms. A good AES algorithm was 
required to be easily implemented on general purpose 
processors and on reconf igurable hardware, and light 
from a computational point of view. 

[0007] NIST judged the Rijndael algorithm to be the 
best algorithm for the AES at the end of a very long 
and complex evaluation process in which all public 
comments, papers, verbal comments at conferences, NIST 
studies and reports had been analyzed. The official 
announcement was made on October 2, 2000 and the 
standard was completed with the publication of FIPS-197 

[1] on November 26, 2001. 
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[0008] Many ways of forming devices for implementing 
efficiently the Rijndael AES encryption/decryption 
algorithm have been investigated- Papers (see for 
instance [2, 3]) that describe implementations of the 
AES algorithm on field programmable gate arrays (FPGAs) 
are present in the technical literature. Few of them 
(see [4]) describe an implementation of the Rijndael 
algorithm on application specific integrated circuit 
(ASIC) platforms. 

[0009] A custom but still flexible implementation, 
however, would be desirable in high speed or embedded 
dedicated cores in which fast and/or low-power 
computation is desirable. The consumed silicon area 
for forming an electronic circuit that performs the 
steps of the Rijndael algorithm is an important 
parameter for custom implementations. The smaller the 
consumed area, the lower the unit cost, and thus the 
higher the number of hardware devices produced on the 
same silicon die. 

[0010] It has been observed [5] that the realization 

of the so-called S-box, which is a hardware device that 
performs the byte substitution operation (ByteSub) 
contemplated by the algorithm, is critical for reducing 
the area consumption. This operation is the bottle 
neck of the algorithm because it must be repeated many 
times and it is implemented by a one-to-one nonlinear 
binary function. Moreover, given that this function is 
nonlinear, it is not possible to form a hardware device 
for performing it by using standard synthesis 
techniques . 

[0011] The ByteSub operation is a composition of two 
binary functions defined on bytes. The first function 
is an inversion in the finite field (or Galois Field) 



GF {2 ) , which is a field composed of bytes, while the 
second is an affine function. Optimum security 
properties are obtained for the entire cipher system by 
combining in cascade these two functions (see for 
instance [6] and [7]). 

[0012] The first function is more complex than the 

second from a computational point of view because it 
behaves almost like a purely random function. For this 
reason^ the S-box is generally synthesized starting 
from the complete truth table of the function. A 
behavioral table-like description is provided to a VHDL 
compiler, and the corresponding combinatorial function 
is extracted and synthesized with logic gates. 
[0013] An alternative approach includes considering 
the ByteSub operation as a function defined on the 
composite finite field GF((2^)^). The input byte is 
separated into two nibbles and the problem of inverting 
the first function is reduced to the problem of 
inverting a function in the inner field GF(2^) which is 
smaller. This method has been originally proposed by 
Rijmen in [8], and further developed in [9]. See also 
[10] V that contains performance estimations of the 
obtained circuit when implemented by the HCCM0S7 
technology library of STMicroelectronics . A further 
study on the optimization of the S-box has been 
proposed in [11], mainly addressing speed issues. 
[0014] Another important issue in custom 
implementations is power consumption. Although many 
techniques are described in technical literature for 
reducing power consumption at the transistor level and 
at higher levels, still a hand-made analysis and design 
is often useful to produce low power custom units. 
Synthesis tools have features for power optimization. 
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but still most of the power can be saved by human 
analysis and by giving the VHDL compiler a good 
starting point, 

[0015] Moreover, power optimization is often in 
contradiction with other design goals, such as a small 
chip area and high speed, high frequency operation. A 
low power approach for designing the Rijndael S-box is 
discussed in [12] . The authors start from the compact 
implementation discussed in [8], exploit a positive 
polarity reed-muller (PPRM) technique, and add delay 
chains to reduce power consumption and glitches of the 
architecture . 

S-unmiary of the Inven-bion 
[0016] In view of the foregoing background, an 
object of the invention is to provide a method for 
implementing one-to-one binary functions defined on the 
Galois field GF(2^). This method is very useful for 
forming fast and low power hardware devices regardless 
of the binary function- In particular, the method of 
the invention is particularly suited for forming 
electronic circuits that implement the Rijndael 
encrypt ion /decrypt ion algorithm. 

[0017] More precisely, the method generates output 
bytes corresponding to respective input bytes according 
to a one-to-one binary function, comprising the steps 
of decoding an input byte generating at least a bit 
string that contains only one active bit, logically 
combining the bits of the bit string according to the 
binary function for generating a 256-bit string 
representing a corresponding output byte, and encoding 
the 256-bit string in a byte for obtaining the output 
byte. 



- 6 - 

[0018] This method may be implemented by a fast and 
small area consuming hardware device for generating 
output bytes corresponding to respective input bytes 
according to a one-to-one binary function. The 

hardware device may comprise a decoder of the input 
byte for generating at least a bit string that contains 
only one active bit. An array of logic gates may be 
input with the bit string for generating a 256-bit 
string by logically combining the bits of the input 
string according to the one-to-one binary function. An 
encoder may be input with the 256-bit string for 
generating the output byte. 

Brief Description of -bhe Drawincfs 
[0019] The different aspects and advantages of the 
invention will appear even more evident through a 
detailed description referring to the attached 
drawings , wherein : 

[0020] Figure 1 depicts an embodiment of the 
hardware device of the invention; 

[0021] Figure 2 illustrates a straightforward 

architecture of the decoder of Figure 1; 

[0022] Figure 3 illustrates the preferred embodiment 

of the hardware device of the invention; 
[0023] Figure 4 shows a detailed view of the left 
and right decoders of Figure 3; 

[0024] Figure 5 compares power-delay performances of 
different AES S-box architectures with that of the S- 
box of the invention; 

[0025] Figure 6 compares power-delay performances 
for different composite field AES S-box architectures 
of the invention; 

[0026] Figure 7 compares silicon area requirements 



of different AES S-box architectures of the invention; 
[0027] Figure 8 compares the power consumption of 
the eight S-boxes implementing the DES algorithm 
formed according to the method of the invention and 
with a table description technique. 

De-bailed Descrip-bion of t:he Preferred Embodimen-bs 
[0028] In the ensuing description, the method of the 

invention is described while referring to the AES S-box 
as an example. However, what is stated below does not 
depend exclusively on the particular binary functions 
of the Rijndael algorithm. In contrast, the method can 
be immediately extended to any one-to-one binary 
function. This method also includes variations that 
achieve different levels of efficiency. They are all 
listed in order, starting from the simplest to the most 
sophisticated one. 

[0029] As discussed above, a careful analysis of a 
Boolean function to be implemented often leads to high 
power savings. The analysis can be made automatically 
by design tools, or by the designer himself at a high 
level of complexity. The S-box implements a nonlinear 
byte substitution function that has good statistical 
properties. The input-output correlation is kept as low 
as possible and the function is very similar to a 
random function. 

[0030] Recently, an analysis that sheds some doubt 
on the designers' claim for robustness of the nonlinear 
part has been published (see [13]). Moreover, there is 
no evidence that this result can be useful for area or 
power reduction in hardware designs, although it can be 
a good starting point for a cryptanalytic attack of the 
cipher algorithm it is contained in. 
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[0031] The ByteSub operation performed by the S-box 
is nothing but a permutation of all the elements of the 
Galois finite field GF(2^), i.e., it is a byte 
permutation. Hence, this operation is a one-to-one 
mapping . 

[0032] Therefore, instead of computing the input- 
output function in a standard way, according to the 
invention, it is possible to do it by decoding the 
input byte for generating at least a bit string 
containing only 1 active bit, for example a 256-bit 
string or two 16-bit strings, then rearranging the bits 
according to the one-to-one binary function for 
obtaining a 256-bit string, as depicted in Figure 1. 
This may be done without any power dissipation . 
Finally, the output byte is obtained by encoding back 
to a byte the 256-bit string obtained after having 
changed the order of the bits. Note that the 
rearrangement of the wires is not strictly necessary, 
because the same result may be obtained by using an 
array of logic gates generating the bits of the 256-bit 
input it in the encoder. 

[0033] The decoder may be designed using AND and 
inverter logic gates, which are present in the 
technology library. Different options are available. 
One approach includes designing it as a single binary 
tree using only two-inputs AND cells, as in Figure 2. 
Note that each input bit wire must be available in 
normal and complemented form and it enables a single 
subtree. The number of AND logic gates that are 
switching when the input sequence changes is at most 14 
because the worst case occurs when the two decoded 
lines belong to two different sub-trees. In this case 
7 gates switch off and 7 gates switch on. The maximum 
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number of inverter gates that may switch is 8 . 
[0034] The drawback of this implementation is that 
power consumption is highly affected by high net fan- 
outs and the input bit wires, and must face an 
increasing fan-out as they are closer to the output of 
the decoder. For instance, bit 0 and 1 have a fan-out 
of only 4 AND gates, while bit 7 has a fan-out of 256 
AND gates (one half of them pertaining to the normal 
form and the other half to the complemented form) . 
[0035] Another approach includes providing a single 
level of 256 eight-inputs AND components, one per each 
decoded line. This approach is faster, but each input 
line has a fan-out of 256 gates. Latency is only that 
of one eight-input AND gate, which is normally faster 
than 6 cascaded two-inputs AND gates. In this case 
just 2 AND gates are switching when the input sequence 
changes . 

[0036] The preferred approach includes forming a 
hardware device for implementing a one-to-one binary 
function with a two-level architecture, as shown in 
Figure 3. The left decoder block and the right decoder 
block have the same internal structure. Their task is 
decoding the most (left) and the least significant 
(right) nibbles of the input byte, respectively, and 
generating respective left and right 16-bit strings 
each having only one active byte by decoding the input 
nibble . 

[0037] An array of 256 two inputs AND gates is 
provided and is input with bits of the 16-bit strings. 
Each AND gate is connected to a unique coit±)ination of 
the left and right decoder output lines. For instance, 
consider the internal line representing the Galois 
field element 0x63 (i.e., the byte 0x63). This line is 
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connected to the output of the AND gate, which in turn 
is connected to the 7^^ output line from the left 
decoder and to the 4*^^ output line of the right decoder. 
This internal line is active only when the left decoder 
receives the value ^6' and switches active its 7^^ 
output line, and when the right decoder receives the 
value ^3' and switches active its 4^^ output line. 
[0038] Given that the inputs of S-box may be 

considered random distributed byte values, it holds 
that each of the left and right decoder input lines has 
a switching probability of 1/2 over a long operating 
period. Each of the 16 left and right decoder output 
lines has a switching probability of 1/16 over a long 
operating period, and each of the 256 internal lines 
has a switching probability of 1/256 over a long 
working period. 

[0039] The internal structure of the left and right 
decoders is represented in Figure 4. This is a normal 
architecture for a priority-less decoder. An immediate 
advantage of the proposed architecture is that the 
maximum fan-out of nets is fixed to 16 gates, i.e., 
keeping low the fan-out of the nets leads to a low 
power consumption. Another advantage is that the 
latency of the decoder is equal to the latency of only 
one inverter gate and of two AND gates. 

[0040] The encoder component is more area consuming 
and brings more switching activity to the global 
circuit. Preferably, it is formed only of OR gates. 
Every encoder output line is a logic OR between exactly 
128 internal lines. For each output bit, the OR-tree 
is practically implemented by 3 levels of components. 
The first level is made of 16 eight-inputs OR gates, 
the second level is made of 2 eight-inputs OR gates. 



and the third level which provides the output bit is 
just a two-input OR gate. 

[0041] Such an architecture is commonly used for 
encoders. Although it is quite area consuming since the 
encoder has 256 input lines, it can be very fast 
because the latency is equal to that of 2 eight-input 
OR gates and 1 two-input OR gate. Moreover, the 
maximum number of switching gates consequent to a 
switching input line is fixed at 16 eight-inputs OR 
gates and at 8 two-inputs OR gates. This corresponds 
to the worst case when all the 8 output lines are 
switching at the same time. The maximum fan-out of the 
encoder is 8, and specifically it is the fan-out of the 
input line (1 out of 256) in correspondence of active 
bits of the output byte. 

[0042] The hardware device of the invention may be 
used for forming an S-box device for the Rijndael 
algorithm making the implemented one-to-one binary 
function be the function representing the ByteSub 
operation. Performances of a Decode-Encode (D&E) AES 
S-box architecture formed according to the scheme of 
Figure 3 starting from VHDL files will now be 
discussed. In this way, the validity of the proposed 
method for the efficient synthesis of one-to-one 
combinatorial networks is demonstrated. 
[0043] The synthesis of the AES S-box has been 
carried out using the design tool Synopsis Design 
Compiler, version 2001.08, and the HCCM0S7 (by 
STMicroelectronics) technology library featuring a 0.25 
Jim process operating at 1.8 V core voltage. Figure 5 
depicts power-delay diagrams from a purely behavioral 
description (S-box bhv) , a mixed structure (S-box D&E 
mixed) and a structural decode-encode description (S- 
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box D&E struct) . 

[0044] The purely behavioral description (S-box bhv) 
is where the S-box truth table is simply given to the 
design tool. This description is useful to test the 
compiler's capability of optimizing the circuit with 
built-in techniques. Note that the VHDL code would be 
the same in case a ROM was present. However, for 
relatively small structures like the S-box, the ROM 
table is often allocated as a combinational logic and 
optimized by the silicon compiler. 

[0045] The mixed structure (S-box D&E mixed) is 
where the decode-encode structure is explicitly given 
to the compiler, but the decoder and the encoder blocks 
are described in a behavioral mode. This description 
may be useful to test the capability of the compiler of 
using logic blocks inside the specific technology 
library to map these circuits. 

[0046] The structural decode-encode description (S- 
box D&E struct) is where the architecture described 
above is explicitly given to the compiler (VHDL is 
written with a structural approach) . The simplest 
components, such as AND and OR gates, are described in 
behavioral code and form the basis for the structural 
description. 

[0047] The proposed decode-encode S-box architecture 
(S-box D&E struct) behaves very well in the region 
between 1.6ns and 3ns time latency. Power consumption 
in this region is about 50% smaller than the power 
consumption of the S-box obtained with a purely 
behavioral description. Both architectures can reach a 
speed limit of 1.53ns using this specific technology 
library, but the proposed architecture is not worthy at 
full speed. It is interesting to note that the 
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compiler is not able to map automatically fast power- 
efficient encoders and decoders, as can be seen from 
the D&E mixed line. The performances of this 
architecture are better than those obtained with the 
purely behavioral approach, and this can be seen as a 
partial benefit of the decode-encode structure. 
[0048] The composite fields architecture has been 
coded as well, and the power delay figures are reported 
in Figure 6, together with a single mapping from the 
CHES2002 low-power implementation, which is basically a 
modification of the composite fields architecture. In 
the latter case it was not possible to model delay 
chains with the HCCMOS library, which are an essential 
part of the architecture. Results are expected to be 
better than the standard composite field case when 
delay chains are correctly modeled. This is true for 
power consumption, but delays will be very similar. As 
may be noticed from the graph, power-delay figures are 
worse for the composite fields architecture, which by 
contrast, is a convenient architecture as far the area 
occupation is concerned. 

[0049] Figure 7 compares minimum area occupation 

figures for the different S-box architectures . To 
prove the general validity of the method of the 
invention for implementing any one-to-one binary 
function, other cases are examined. If the values of 
the function implemented by the AES S-box have to be 
changed for security reasons, it is sufficient to 
change the arrangement of the internal lines of the S- 
box of the bit strings obtained by decoding the input 
byte without increasing the power consumption. 
[0050] Another possibility includes inserting 
multiplexers in the inner switch to support the direct 
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and inverse ByteSub simply with an additional control 
line. This will lead to important area savings, given 
that encryption and decryption operations are not used 
at the same time. 

[0051] It is possible to optimize the S-boxes of 

other different cryptographic algorithms as well. The 
decode-encode'' scheme is still applicable for any 
hardware device implementing an one-to-one binary 
function. For instance, the method of the invention 
may be applied to the DES (Data Encryption Standard) 
algorithm [14], as well as to the Kasumi algorithm" 

[15], which is the candidate encryption algorithm for 
the new UMTS infrastructure. 

[0052] Figure 8 shows the power consumption figures 
of the eight S-boxes for performing the DES algorithm. 
These components have been synthesized, first starting 
from a table description and then by the proposed 
method. The power saving is evident, and a further 
unexpected but favorable result includes that the eight 
S-boxes dissipate substantially the same power because 
they differ only in the way the wires are permuted. 
This is an important feature, because it allows ^'hot 
spots'' to be eliminated in the design and allows a more 
regular power consumption of the whole circuit- 
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